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Joint genotyping on the fly: Identifying variation 
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High-throughput sequencing is enabling remarkably deep surveys of genomic variation. It is now possible to completely 
sequence multiple individuals from a single species, yet the identification of variation among them remains an evolving 
computational challenge. This challenge is compounded for experimental organisms when strains are studied instead of 
individuals. In response, we present the joint Genotyper for Inbred Lines []G\L] as a method for obtaining genotypes and 
identifying variation among a large panel of inbred strains or lines. JGIL inputs the sequence reads from each line after 
their alignment to a common reference. Its probabilistic model includes site-specific parameters common to all lines that 
describe the frequency of nucleotides segregating in the population from which the inbred panel was derived. The 
distribution of line genotypes is conditional on these parameters and reflects the experimental design. Site-specific error 
probabilities, also common to all lines, parameterize the distribution of reads conditional on line genotype and realized 
coverage. Both sets of parameters are estimated per site from the aggregate read data, and posterior probabilities are 
calculated to decode the genotype of each line. We present an application of JGIL to 162 inbred Drosophila melanogaster lines 
from the Drosophila Genetic Reference Panel. We explore by simulation the effect of varying coverage, sequencing error, 
mapping error, and the number of lines. In doing so, we illustrate how JGIL is robust to moderate levels of error. Supported 
by these analyses, we advocate the importance of modeling the data and the experimental design when possible. 



[Supplemental material is available for this article.] 

With recent advances in high-throughput sequencing technology, 
sequencing a genome can be accomplished affordably and with 
great speed. Consequently, it has become possible to survey geno- 
mic variation at unprecedented resolution, and many such studies 
are under way (e.g., The 1000 Genomes Project Consortium 2010; 
Li et al. 2010; Yi et al. 2010). Despite the ease with which massive 
quantities of sequence reads can now be obtained, interpretation 
of the data is nontrivial. Errors in sequencing and assembly may 
masquerade as genomic variation, and biases in these processes 
can complicate the resolution of heterozygous genotypes. By ne- 
cessity, novel analysis tools are being developed with these chal- 
lenges in mind. 

In particular, for population genomic studies in which mul- 
tiple individuals are sequenced, joint analysis has proven to be 
a successful strategy for resolving genotypes and identifying seg- 
regating genetic variation (The 1000 Genomes Project Consortium 
2010; Le and Durbin 2010; DePristo et al. 2011; Li 2011). Provided 
that the sequenced individuals are related in some way (e.g., 
a family or a population sample), joint inference can be used to 
leverage the dependence between individual genotypes. The ra- 
tionale behind joint inference is intuitive: Knowing that one in- 
dividual has an A allele increases the likelihood that a relative, 
however distant, also has an A. This feature is particularly useful 
when many individuals are sequenced at low coverage, because the 
aggregate coverage still permits an accurate description of variation 
at the population level. Knowledge of the population, in turn, may 
improve the accuracy with which each individual's genotype can 
be resolved. 

Whereas population genomics considers individuals within 
a population, cancer genomics targets a population of cells within 
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an individual. For example, the latter might apply high-through- 
put sequencing technology to a tumor sample toward identifying 
variation with respect to the normal genotype of an affected in- 
dividual. Although the technologies facilitating population and 
cancer genomics are largely the same, each field relies on distinct 
tools for data analysis. In particular, methods specific to cancer 
genomics must respect the fact that allele frequencies within a so- 
matic tumor sample need not follow the germline expectation of 
strict homozygosity (0%/100%) or heterozygosity (50%). Several 
approaches doing so are in use already (Koboldt et al. 2009; Goya 
et al. 2010), with several more reported to be under development 
(Ding et al. 2010). 

Each of the methods referenced above shares a common 
motivation in applications to human genomics; however, high- 
throughput sequencing technology is transforming the genomic 
studies of other species as well. Sometimes the methods developed 
in response to human studies apply equally well to other organ- 
isms, but this is not always the case. For example, an often dis- 
tinguishing characteristic of non-human genomics is that a strain 
is studied rather than a single individual. By design, the individuals 
comprising a strain are genetically very similar, giving meaning to 
the concept of a strain's genotype. On the other hand, in most 
cases, a strain is not isogenic, implying that some degree of genetic 
variation remains. In that sense, a strain is not unlike a tumor 
sample, because while the strain is viewed as an individual, there is 
a population that underlies it. 

It is becoming increasingly common for multiple strains to be 
studied simultaneously, for example, to map the quantitative trait 
loci responsible for a particular phenotype (Aylor et al. 2011; Cao 
et al. 2011). A prerequisite for such studies is to identify and char- 
acterize genetic variation, which is consistent with the goals of 
population genomics. High-throughput sequencing promises the 
unbiased investigation of large strain panels, but only once each 
strain's genotype has been accurately resolved. Following the logic 
of human population studies, if the strains comprising a panel are 
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related in some way, there should be some benefit in simulta- 
neously decoding the genotypes of each strain. But, following the 
lesson of cancer genomics, it must be appreciated that a strain is 
a population and not an individual. 

This study considers the problem of joint genotypic inference 
on a large panel of strains. The work was motivated by challenges 
encountered in the Drosophila Genetic Reference Panel (DGRP) 
(Mackay et al. 2012), a collection of 192 inbred D. melanogaster lines 
("strains"). The DGRP lines were derived from a common Raleigh, 
NC population by 20 generations of full-sib mating (Fig. 1). The 
offspring of the sib pair at generation 20 and their descendants in 
perpetuity comprise the inbred line. Sequencing of the DGRP lines 
was done primarily on the Illumina GAII platform, and crucially 
what was sequenced for each line is DNA from a large pool of flies 
(between 500 and 1000 flies) (see Fig. 2). From these DNA pools, we 
sought to obtain genotypes simultaneously for each inbred line 
while respecting the fact that each line is itself a population. Be- 
cause we know precisely how the lines are related, we were able to 
incorporate the DGRP experimental design. In what follows we 
describe our probabilistic model and its implementation as the 
Joint Genotyper for Inbred Lines (JGIL). We demonstrate howJGIL 
performs on data from the DGRP, and we report the results of 
a simulation study designed to interrogate how performance varies 
with coverage, sequencing error rate, mapping error rate, and the 
number of lines in the analysis. 

Genotypes for inbred lines 

The effect of inbreeding is to reduce the genetic variation within 
each DGRP line so that the majority of sites are fixed. At such 
positions, the genotype of the line is simply the genotype of any 
one individual; however, for sites that harbor residual variation, 
what is meant by the line's genotype is unclear. Our model of re- 
sidual variation and our corresponding definition of "genotype" 
are directly based on the inbreeding scheme (Figs. 2, 3; Table 1). 
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Figure 1 . Experimental design for DGRP creation and sequencing. Each 
DGRP line was founded by a mated female collected from the Raleigh, 
North Carolina Farmer's Market. Each subsequent generation was created 
by crossing a pair of male and female progeny from the previous gener- 
ation. The DGRP lines were produced by 20 generations of full-sib in- 
breeding. For each line, high-throughput sequencing was performed on 
DNA that was extracted from a pool of 500-1 000 flies. 
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Figure 2. Schematic of model for site-specific joint genotypic inference. 
The philosophy of JGIL is for the model to mirror the experimental design. 
The founding females and their implicit mates are sampled from a com- 
mon population that is described by population-level parameters. The 
experimental design specifies the probabilistic dependency between the 
genotypes of these initial flies and those of the pooled inbred samples that 
are ultimately sequenced. Technical and bioinformatic errors are modeled 
as a site-specific phenomenon that applies equally to all lines. This allows 
"line effects" (i.e., genotypes) and "nucleotide effects" (i.e., errors) to be 
disentangled. 



Specifically, we take advantage of the fact that residual variation in 
the line is reflective of residual variation in the sib pair mated in 
generation 20 (excluding the contribution of new mutations) by 
relating the genotype of the line to those of this sib pair. We make 
two simplifying assumptions, namely, that (1) at most, two alleles 
are present among the progenitors of any one line (i.e., generation 
0, although we permit more than two alleles to be segregating in 
the initial population); and (2) the transmission of alleles from the 
G20 sib pair to their inbred line (see Fig. 2) is not distorted. These 
assumptions allow us to define 22 discrete "line genotypes" (see 
Table 1) and assign probabilities to each (see Methods) based on 
the inbreeding design and the composition of the Raleigh pop- 
ulation (as given by p below). 

Joint genotypic inference 

Restricting attention to one genomic site, consider m lines with 
genotypes G x , . . . , G m (collectively G) and respective sets of cov- 
ering reads Ri,...,R m (collectively R). Let 6 be a parameter vector 
that models aspects of the sequencing process and of the in- 
dividuals themselves. The advantage of joint inference lies in using 
all of the data Ri , . . . , R m to obtain a better estimate of 6 than would 
be possible when considering each individual separately. 

We analyze each genomic position separately, and the values 
taken by the parameter vector 0 are unique to each position. The 
vector Q = {p Al p Cl p Gl p Tl s Al sc 1 s Gl ST) = {p 1 s) includes eight param- 
eters. The entries of p sum to 1 and describe the position-specific 
nucleotide frequencies in the common population from which the 
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Figure 3. Detailed model and estimation framework for one line at one 
site. Frequencies of A, C, G, and T in the population govern the distri- 
bution of parental genotypes in generation 0 (AA x AC in the figure). The 
distribution of parental genotypes at generation 20, conditional on the 
genotypes at generation 0, is specified by the Markov chain described in 
Methods. It is assumed that this cross produces many offspring in the 
absence of segregation distortion so that the nucleotide frequencies 
among the offspring match those of the parents. The sequencing reads, 
which are the observed data, are composed of a random sample of these 
nucleotides (with replacement) along with errors whose frequencies are 
given by s. The unobserved, or missing, data are composed both of the 
parental genotypes at generation 20 (here G= 1 for AA x AA) (cf. Table 1) 
and indicators (7,- = 0,1) that record the error status of each read. For 
example, 7 3 = 1 in the figure because the T is an error, which is clear when 
the "genotype" G=l is known, provided mutation is precluded. While 
each line has its own observed and missing data, the global parameters p 
and s are common to all lines. 



models mapping error and context-dependent sequence error, but 
not variability in sequence quality due to read position. 

Estimation strategy 

We estimate 0 by maximum likelihood. The joint probability of the 
read data R across all lines is given by 

m 22 

Pr(R = r|0) = J] E Pr ^ = n\Gi = £,£)Pr(G z = *|p) 

z = l <?=! 

and we seek 6 = argmax 0 L(6|R) = argmax e Pr(R|6). We do so using the 
Expectation-Maximization algorithm (Dempster et al. 1977). We 
envision two layers of missing data, namely the genotypes G and 
error indicator variables for each read, which we will call Y (see Fig. 3; 
Methods). While the algorithm only guarantees the discovery of a lo- 
cal maximum of the likelihood surface, for this application it appears 
as though the global optimum is being found whenever the initial 
choice of 6° is reasonable. Nearly exact analytical solutions to the 
maximization step (i.e., 0 1 =argmax 0 E G Y , Re=e o[logPr(R, G, Y|0)]) 
are given in the Methods section. An analysis of the approximation 
error is given in Supplemental Figure 1. 

Maximum a posteriori genotypes 

Upon obtaining a maximum likelihood estimate 0 of the pa- 
rameter vector 0, the posterior probability of each genotype 
can be computed and is proportional to Pr(^=#|i? z = r z -,0j. 



Maximum a posteriori (MAP) 
gi = argmax Pr (G z =g\R t = r u e) . 



genotypes are assigned as 



Genotype quality scores 

Quality scores are assigned to each called genotype according to 
the phred scale (Ewing and Green 1998). Specifically if the prob- 
ability of the MAP genotype is P, then we report its quality as 
Q=-101og 10 (l-P). 



inbred lines were derived. The entries of £ describe the position- 
specific probabilities of obtaining a read with an erroneous base of 
A, C, G, or T, respectively. Thus, 0 parameterizes two generative 
probability distributions for each line v. (1) Pr(G z |0) =Pr(G z |p), 
which gives the distribution of the line's genotype G z given the 
population frequencies p; and (2) Pr(i? z |G z , 0) = Pr(i? z |G z , e), which 
gives the distribution of the line's reads Ri given its genotype G z and 
error profile £. Note that while s is specific to each genomic site, it 
does not vary with the position that reads cover a given site. Thus, £ 



Results 

We begin with a summary of how JGIL performs on the Illumina 
data generated as part of the DGRP project. We evaluated this 
performance by means of three independent comparisons. First, 
because 29 of the DGRP lines considered here were also sequenced 
on the 454 Life Sciences (Roche) platform, we were able to compare 
the consistency of genotype calls across technologies. Second, 
because a pair of duplicate lines was sequenced, we were able to 
compare the consistency of genotype calls across biological/tech- 



Table 1. Encoding of the 22 possible "genotypes," summarized as 10 states 
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Each entry M,j in the 4 x 22 matrix M records the number of / alleles (A = 1, C = 2, G = 3, T = 4) in genotype /. The 22 columns of M are what we 
call "genotypes" and are further collapsed into 1 0 states that describe the nucleotides segregating within a line at a site. The states are (1 ) A, adenine; (2) 
C, cytosine; (3) G, guanine; (4) T, thymine; (5) M, amino [A and C]; (6) R, purine [A and G]; (7) W, weak [A and T]; (8) S, strong [C and G]; (9) Y, pyrimidine 
[C and T]; (1 0) K, keto [G and T]. 
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nical replicates. Third, because a previous study used Sanger se- 
quencing to target five ~l-kb regions of the X chromosome in all 
of the DGRP lines, we were able to compare a subset of the JGIL 
genotype calls to this gold standard. 

Analysis of the DGRP data 

JGIL was used to genotype the 162 inbred D. melanogaster lines in 
Freeze 1 of the DGRP for which Illumina sequences were available 
(Mackay et al. 2012). A second set of 40 lines, including 29 of these 
162, was sequenced on the 454 platform, and from these JGIL 
produced a second set of genotype calls. The 29 lines sequenced on 
both technologies form the basis of our first comparison. 

Specifically, for each of the 29 lines, we genotyped —119 Mb 
of euchromatic DNA on the five major chromosome arms (X, 2L, 
2R, 3L, 3R). For each line at each site, the MAP genotype was 
reported as one of the 10 states in Table 1 provided it received 
a quality score of 20 and there was greater than zero coverage; 
otherwise, the genotype was assigned an "N." Genome-wide, 
when a genotype was reported for both technologies, the agree- 
ment between them was 99.97%. In only 0.33% of cases, the 
Illumina data produced an N when the 454 data yielded a called 
genotype. Owing in part to lesser coverage, the 454-based call was 
an N in 2.09% of cases when a call was made from Illumina data. 

The SNP sets from both technologies were also highly con- 
cordant. Both identified roughly the same number of SNPs (—2.8 
million), and the vast majority were in common (—2.5 million). 
Recognizing that a single genotyping error can yield a false-posi- 
tive SNP, we stratified the concordance between technologies 
according to minor allele frequency. As expected, the concordance 
is lower when the minor allele is rare; upon excluding the single- 
ton class, both technologies identify the same SNP well more than 
90% of the time. 

JGIL is unique in modeling the four nucleotide frequencies 
rather than the frequency of reference and alternate alleles. This 
allows the identification of sites at which three alleles are present 
among the lines, as well as sites at which there exist two alleles 
distinct from the reference. We found 30,456 such sites in com- 
mon between the two call sets, which represents >1% of all SNPs. 
Equally important, JGIL is able to make correct genotype calls in 
these unusual but not so infrequent cases. The incidence of such 
sites (30,456 out of 1 19,029,689, or 0.026%) is commensurate with 
the frequency of discordance between the two call sets (0.03%), 
suggesting that a failure to model them will substantially elevate 
the error rate. 

Overall, the concordance across technologies between geno- 
types and between SNP sets suggests that JGIL is performing well. 
As further validation, we next compared to a pair of putative rep- 
licates on the Illumina platform, namely, lines RAL_554 and 
RAL_555. RAL_554 and RAL_555 appear to be replicates of one 
another, but this was not established until the sequencing of both 
was complete. Comparing the genotype calls for these two lines 
thus provides a measure of the error rate one should expect due to 
sampling and technical variability. Irrespective of quality, the 
overall agreement between MAP genotypes was 99.95%. To gauge 
the effect of our quality threshold (Q = 20), we classified each site 
according to the minimum quality of the two genotype calls being 
compared. We observed that for 99.75% of all sites, both genotypes 
were assigned Q > 20, and among these sites the calls agreed 
>99.99% of the time. Among the remaining 284,693 sites com- 
prising the lower-quality class, the agreement was only 84.06%. 
This suggests that genotype quality scores empirically increase 



with the probability of the genotype being correct. To quantify this 
relationship, we looked at empirical quality as a function of esti- 
mated quality (again, the minimum of the two scores). The cor- 
relation between these quantifies the agreement between what is 
observed [empirical quality, measured as -101og 10 (l - accuracy)] 
and what is expected (based on JGIL posterior probability); as 
shown in Figure 4, that correlation is r= 0.98. That said, the slope of 
the line relating these two quantities is substantially <1 0 = 0.57), 
suggesting that JGIL quality scores are overestimating the empiri- 
cal quality by a predictable amount. For example, the threshold of 
Q = 20 [Pr(error) = 0.01] discussed above appears to equate to an 
empirical quality closer to 14 [Pr(error) = 0.04]. The strong linear 
relationship between the two (Fig. 4) suggests that recalibration 
should be possible provided a training subset of genotypes is 
known in advance. However, even in the absence of recalibration, 
it is clear that JGIL quality scores are very reliable indicators of 
accuracy. 

The comparisons above establish the reproducibility of JGIL 
genotype calls across technologies and replicates. Collectively, they 
suggest that JGIL is highly accurate, but neither constitutes valida- 
tion in the strictest sense. Thus, we turned to a third comparison 
based on targeted Sanger sequence data. Specifically, we obtained 
data for each DGRP line from five previously sequenced regions 
on the X chromosome (Arya et al. 2010). Taken together, these 
five regions (9,108,929-9,109,735 bp, 19,029,113-19,029,901 
bp, 20,287,749-20,288,836 bp, 20,289,014-20,289,991 bp, 
20,290,640-20,291,880 bp) yielded a validation set composed of 
nearly 5000 bp per line. 

Assuming the Sanger sequence data to be correct, we sought 
to quantify JGIL's performance as before. There were 607,776 line/ 
site combinations for which a Sanger call was available; the Illu- 
mina-based JGIL genotype was in agreement in 607,447 of these 
cases, yielding an accuracy of 99.95%. This includes JGIL calls with 
quality scores below 20; these were few, and their accuracy was 
78.79%. It is possible for a JGIL genotype call to attain a high 
posterior probability in the absence of any coverage. For example, 
if across many lines with nonzero coverage there appears to be 
no variation, JGIL would predict (via its probabilistic model) that 
lines with no coverage likely share a genotype with the covered 
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Figure 4. Empirical quality versus estimated quality. The genotype calls 
for replicate lines RAL_554 and RAL_555 were compared after stratifica- 
tion by minimum quality score (estimated quality, x-axis). Within each 
stratum, the proportion of sites for which the two calls agreed was cal- 
culated and converted to a quality (empirical quality, /-axis). The re- 
gression of empirical quality on estimated quality (black line) is highly 
significant (R 2 = 0.9596; p ~ 0). Quality scores were truncated at 40 and do 
not appear in the plot; the empirical quality among all sites with estimated 
quality Q > 40 was 47 (data not shown). 
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lines. The implicit assumption is that the 
absence of coverage is stochastic, but it is 
easy to imagine scenarios (e.g., a deletion) 
in which it is biological. We mitigate the 
latter case at the expense of the former by 
masking uncovered sites as N regardless 
of the probability of the MAP genotype. 
Between low-quality calls and uncovered 
sites, 6510 JGIL calls were masked in the 
validation set, representing close to 1% of 
the total. Among these, had we instead 
reported the MAP genotype, it would have 
been correct according to the Sanger data 
nearly 99% of the time. 

Thus far we have assumed the vali- 
dation set to be correct. However, because 
JGIL appears to be highly accurate, the 
Sanger error rate cannot be ignored. Be- 
cause we expect this rate to be low, errors 
in the validation set are unlikely to distort 
measurements of genotype accuracy. But, 
because a single error in any line is 
enough to falsely classify a site as variable, 
even small error rates can strongly distort 
statistics regarding SNP detection. Rea- 
soning that most errors would transform 
invariant sites to singleton SNPs, we re- 
stricted attention to variable sites in the 
validation set for which two or more lines 
harbored the less frequent allele. There 
were 90 such sites present, and JGIL iden- 
tified 83 of them. This suggests a false- 
negative rate of just below 8%, which 
may in part explain why some SNPs in 
the initial comparison were private to 
one of the two technologies. False posi- 
tives will also contribute, and we can 

quantify their incidence as well. Among the 4757 monomorphic 
sites in the Sanger data, JGIL identified 10 SNPs, which equates to 
a false-positive rate of 0.2%. 

Each of the above comparisons is evidence that JGIL is per- 
forming well. Sometimes this performance is the result of a clear 
signal in the mapped reads; other times, however, JGIL is able to 
reason through data that are, in qualitative terms, confusing. 
Generally speaking, three features allow JGIL to accommodate 
unusual and aberrant sites. First, because JGIL estimates a site- 
specific allele frequency vector p, it is able to accommodate sites at 
which more than two alleles are segregating within the sample. 
This cannot be accomplished in a biallelic model in which only 
ancestral and derived bases are considered. Second, because JGIL 
estimates a site-specific error probability vector £, it can properly 
genotype sites whose reads are contaminated with a low level of 
mapping error. An example of this is given in Figure 5A. Shown is 
a summary of the Illumina and 454 read data at position 8802 on 
chromosome 2L, from which it is apparent that 18 of the 28 lines 
have Illumina coverage for a C nucleotide that is not corroborated 
by 454. JGIL estimates from the Illumina data that p= (1, 0, 0, 0) 
and £ = (0, 0.067, 0, 0), essentially reasoning out that every C is an 
error. For every line, the posterior probability of A is effectively 1, 
and JGIL calls this site as invariant. 

In more extreme cases, such as that of position 18818 on 
chromosome 3R, the degree of error in the data may be too much 
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Figure 5. Two examples of JGIL applied to the DGRP. (A) Data for chromosome 21, position 8802 
generated on two sequencing platforms for 28 DGRP lines. Each vertical bar summarizes the data for one 
line from both the Illumina GAM (oriented upward) and the Roche 454 machines (oriented downward). 
The height of the bar indicates coverage and is partitioned into counts of A (gray) and C (black). Each bar 
is labeled with the index of its corresponding line. Note that every 454 read shows an A, strongly 
suggesting that the Illumina C reads are erroneous. (B) As in panel A for chromosome 3R, position 
1 881 8. Based on the Illumina data, JGIL assigns to four of the lines (365, 399, 705, 71 4) a homozygous C 
genotype, but this is not supported by the 454 data. 



for JGIL to overcome. As Figure 5B shows, the C nucleotide is 
ubiquitous among the lines, and for several of the lines there is 
no coverage of A at all. On the other hand, whenever reads with 
A are present for a line, they are present in greater number 
than those with C. JGIL estimates from the Illumina data that 
p=(0.87,0.13,0,0) and £=(0,0.24,0,0), indicating that there is 
some confusion over the presence of C (as evidenced by the dot 
product of p and £ being substantially positive). For the four lines 
that have no coverage for A (365, 399, 705, 714), JGIL makes a C 
call, although there is some posterior support for an A call in each 
case. For the remaining 24 lines, JGIL calls an A, which the 454 data 
suggest is probably correct. Importantly, despite the conflicting 
signals in the data, the method is not misled into calling rampant 
residual heterozygosity at this site. Thus, while the contamination 
is sufficient to mislead JGIL on a subset of the genotype calls, the 
estimated error probabilities can rescue the calls of lines for which 
the mapping error is lesser. This is the third feature of JGIL: Because 
the experimental design is modeled through Pr(G|p), the method 
is less prone to calling joint genotypes that are wildly inconsistent 
with the expectation of 20 generations of inbreeding. JGIL will still 
identify rare sites at which there appears to be residual heterozy- 
gosity in many lines, but for it to do so, the signal must be strong 
and not more easily attributable as mapping error. 

To explore these observations formally, we turned to a series 
of simulation studies. In what follows, we illustrate and quantify 
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how the performance of JGIL responds to varying coverage, sequenc- 
ing error, mapping error, and the number of lines in the analysis. 

Simulation results 

We first show how genotyping error rate decreases as the number of 
lines being genotyped grows. We then demonstrate the robustness 
of JGIL to moderate mapping error. Finally, we interrogate the utility 
of JGIL for population genomics by quantifying its ability to ascer- 
tain SNPs and estimate allele frequencies. Throughout this section, 
we use a Poisson model of coverage and uniform model of enor. 

We begin by illustrating the quantitative benefits of joint 
genotyping. Figure 6 plots genotyping error rate against sequenc- 
ing error rate for nine combinations of coverage (two, five, 10) and 
number of lines (two, five, 10). For a given sequencing error rate 
and fixed coverage, it is clear that the genotyping error rate de- 
creases as the number of lines increases. This decrease is directly 
attributable to the simultaneous inference of allele frequencies and 
error probabilities; moreover, it becomes more dramatic at higher 
sequencing error rates. Intuitively, JGIL borrows strength across lines 
to disentangle signal (i.e., genotypes) from noise (i.e., error). This 
feature becomes even more pronounced in the face of mapping error. 

Previously, Figure 5 summarized two DGRP sites for which 
estimating error probabilities across lines led to robust genotype 
calls despite considerable mapping error. Supplemental Figure 2 
presents a simulation study of the same phenomenon in which 
mapping error was simulated for varying numbers of lines and 
coverage. Supplemental Figure 2 A plots the genotyping error rate 
observed at coverage 2. The frequency of mismapped reads was 
set as a percentage of the expected coverage for the true allele; 
for example, 50% represents an expected coverage of 1 for the 
erroneous base. Remarkably, even at minimal coverage, five lines 
appear to be sufficient to mitigate substantial mapping error 
(Supplemental Fig. 2B). Just as in Figure 5, JGIL is able to proba- 
bilistically identify the mismapped nucleotides and diminish their 
contribution to the genotype calls. For absolute levels of mapping 
error, e.g., comparing 50% enor at coverage 2 to 20% error at cov- 
erage 5 and 10% error at coverage 10, higher coverage of the true 
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Figure 6. Genotyping error rate as a function of sequencing error rate, coverage, and sample size. At 
a single site, sequencing read data were simulated for either two (square), five (circle), or 1 0 (triangle) 
lines assumed to be homozygous for the nucleotide A. These were considered in combination with 
simulated coverage of either two reads (red), five reads (blue), or 1 0 reads (black). For sequencing error 
rates ranging from 0.01 to 0.20, the JGIL genotyping error rate was calculated across 1 0,000 replicate 
simulations. 



allele leads to a lower genotyping error rate. If, however, the mis- 
mapped reads are derived from another location in the sequenced 
genome, then it may be more reasonable to compare across relative 
levels of mapping error. Here, the relationship appears to be more 
complicated. At least under a naive Poisson model, a high relative 
mapping error, say 90%, has the potential to be more problematic 
when the coverage is higher. Faced with the consistent signal of two 
nearly equally frequent alleles, JGIL will assume that there is ubiq- 
uitous residual heterozygosity at the site. This scenario occurs more 
often at coverage 10 than at coverage 2, leading to the seemingly 
counterintuitive result shown in Supplemental Figure 2. 

Thus far we have focused on individual call quality by attempt- 
ing to quantify genotyping error rates as a function of specified 
covariates. We next sought to quantify the performance of JGIL on 
a per-site basis rather than on a per-call basis, with the goals of SNP 
detection and allele frequency estimation in mind (Lynch 2009). 
We used simulation to characterize SNP ascertainment bias, as well 
as to measure any bias in allele frequency estimates, as a function of 
the true allele frequency, the enor rate, and coverage. The first two 
rows of Supplemental Figure 3 summarize a study in which 100 lines 
were genotyped from reads of average depth two (left column), five 
(middle column), and 10 (right column). The top row reports how 
often among 10,000 replicates a SNP went undetected given 
a specified error rate (between 0.01 and 0.1 inclusive) and a speci- 
fied minor allele frequency (between 0.01 and 0.1 inclusive, cor- 
responding to between one and 10 lines). It is evident and not 
surprising that singletons are difficult to detect at low coverage. It 
is also clear, however, that either an increase in coverage or an 
increase in minor allele frequency is sufficient to allow rare SNPs to 
be detected. As the figure shows, this holds true even as the error 
rate increases. The middle row of the figure quantifies bias in allele 
frequency estimation under the same simulation conditions. 
Again, for low coverage, the bias is appreciable, but here the bias 
increases with minor allele frequency. This observation is artifac- 
tual in that, in the absence of covering reads, there is more a priori 
support and hence more a posteriori support for the major allele 
than for the minor allele. If one filters on either posterior probability 
or realized coverage, then the trend toward increasing bias goes away. 

The bottom row of Supplemental 
Figure 3 contrasts the remainder of the 
figure in that there is assumed to be no 
minor allele. Thus, whereas the top row 
concerns false negatives, the bottom row 
considers false positives. Each simulation 
specified the expected coverage (two, 
five, 10), the error rate (from 0.01 to 0.10), 
and the number of lines to be analyzed 
(from five to 50); for each combination, 
o ° the figure reports the proportion of 

o ° □ 10,000 replicate simulations in which 

JGIL mistakenly called a SNP. It is clear 
D v v v v that there is an appreciable false-positive 
o o g rate at low coverage for moderate sequenc- 

I % % t I ing error. Moreover, this error rate evidently 

_j i increases with the number of lines. As the 

016 0-18 °' 20 number of lines grows, more chances 
arise for a line to be covered exclusively 
by erroneous alleles. The probability of 
such an event is higher at low coverage 
and higher as the error rate increases. The 
effect is similar to that of the mapping 
example in Figure 5B for lines 365, 399, 
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705, and 714; if the covering reads are exclusively wrong, JGIL cannot 
help but be misled. On the other hand, from the figure, it appears 
that, for moderate coverage, sporadic sequencing error does not 
contribute much to the false-positive rate. But note that the same 
cannot be said for systematic errors (e.g., mapping or sequencing 
chemistry bias) such as those considered in Supplemental Figure 2. 

Discussion 

Improvements in sequencing technology are facilitating in- 
creasingly deep studies of genomic variation. Sometimes the unit of 
study is not a single individual but rather a strain within which the 
genetic diversity is minimal. To capture within-strain variation, it is 
common to pool the DNA of many individuals for sequencing, cre- 
ating a mixture of genotypes that represent the strain. It was in this 
context that we developed JGIL to identify variation among a large 
panel of strains. We introduced a framework for simultaneously 
genotyping multiple strains by jointly analyzing their sequence reads. 
We showed how the nature of the data and the design of the exper- 
iment could be incorporated into a single probabilistic model. We 
implemented this model as JGIL for the DGRP, and we showed that it 
produced highly accurate genotypes in the absence of heuristics. 
Despite this focus, our framework addresses concerns that are general 
and fundamental to the sequencing of inbred lines and strains. JGIL 
was tailored to the DGRP only in how the probability Pr(G|0) was 
structured; the number of generations can already be varied, and 
extensibility to other experimental designs is straightforward. 

Because JGIL did not use heuristics on top of its probabilistic 
model, its treatment of sequence data is both sophisticated and 
naive. Sophisticated aspects of the method include its treatment 
of experimental design and its approach to modeling nucleotide 
frequencies and errors. These conferred several benefits in the DGRP 
analysis. First and foremost, JGIL was not misled into overestimating 
the number of sites at which each line harbors residual variation. 
The incidence of "heterozygotes" would have been much higher 
had the lines been viewed as individuals, had they been considered 
in isolation, or had they been considered in the absence of the ex- 
perimental design. Second, JGIL was able to identify and accurately 
assign genotypes at sites where three alleles were present between 
the lines and the reference. Third, JGIL appears able to mitigate some 
degree of mapping enor, and its parameters can be used to quanti- 
tatively flag sites where the data appears qualitatively confusing. 

Other aspects of JGIL may certainly be considered naive. For 
example, JGIL considers each site in isolation, thereby disregarding 
any information encoded in neighboring sites. It is well appreci- 
ated that haplotype information generally has the potential to 
improve genotyping accuracy (see, e.g., Nielsen et al. 201 1). On the 
other hand, just as the meaning of a strain genotype is compli- 
cated, so too is the meaning of a strain haplotype, and when DNA 
is pooled before short-read sequencing, much of the haplotype 
signal is lost. JGIL may also be considered naive in its ignorance of 
other types of genetic variation such as indels and copy number 
variants. Our approach takes as input the result of a reference- 
guided assembly, and as such its ability to ascertain genotypes is 
predicated on the fidelity of mapped reads. Indels complicate 
mapping in several ways, leading to both false negatives (e.g., 
proximal nucleotide variants missed because of a failure to map 
correctly) and false positives (e.g., artificial nucleotide variants 
created by reads that are incorrectly mapped). This is a limitation 
that JGIL shares with any method that conditions on the mapped 
reads; however, it may be mitigated somewhat by applying post 
hoc filters to flag sites where the alignment appears dubious. 



We believe that the naive aspects of JGIL are more than 
compensated for by its sophistications. In particular, we emphasize 
the importance of modeling the data and the experimental design, 
and we advocate that when possible, Pr(G|0) should be specified 
with these issues in mind. None of this is to the exclusion of 
complementary approaches that improve data quality and/or in- 
ference. In our application to the DGRP, for instance, we relied on 
the GATK package to improve the JGIL input, and we certainly 
envision that others will apply post hoc filters to the JGIL output. 
Our purpose here was to introduce a probabilistic model specifi- 
cally designed to genotype a panel of strains and to show that it 
could be implemented to achieve highly accurate genotype calls. 

Methods 

Defining and encoding genotypes 

We define the genotype of a line as the allelic content of its final 
full-sib mated pair (see Fig. 1). For example, if in one line both of 
these flies were AA homozygotes (as one might expect after 20 
generations of inbreeding), we call and code the genotype for the 
line as (4,0,0,0), meaning 4 A alleles and 0 alleles of C, G, and T 
respectively. Alternatively, if this last generation of full-sib mating 
features an AC X AA cross, the genotype is then (3,1,0,0). As 
a simplification, we have reduced the number of possible line ge- 
notypes to 22 (see Table 1) by assuming that no more than two 
nucleotides will be represented in this cross for any one line. We 
denote the 4 X 22 matrix of possible genotypes in Table 1 as M. 

Sequencing and data preprocessing 

Each line was sequenced on the Illumina platform to at least 12 X 
coverage; reads ranged from 36 bp to 110 bp in length. Lines se- 
quenced on the 454 machine had a minimum of 5 X coverage with 
the majority above 10 X, with reads ranging from 200 bp to 400 bp 
in length. Details for each line are given in Mackay et al. (2012). The 
bwa software (Li and Durbin 2009) was used to align the sequencing 
reads for each line to the D. melanogaster reference sequence 5.13 
(obtained from FlyBase). The GATK package (McKenna et al. 2010) 
was used to recalibrate and locally realign the bam files; recalibra- 
tion was seeded with a liberal list of putative variants obtained via 
AtlasSNP (Shen et al. 2009). Reads with a mapping quality below 
10 were discarded prior to variant calling. Bases with base quality 
below 25 were also removed from consideration. DGRP commu- 
nity resources, including the lines, sequences, read alignments, 
and SNPs are publicly available as detailed in Mackay et al. (2012). 

Modeling the inbreeding process through Pr[G|p) 

Full-sib inbreeding is an iterative sampling procedure in which the 
genotypes at generation n + 1 depend on those in generation n. The 
process follows a Markov chain whose initial state, the genotypes 
at generation 0, has a probability distribution specified by the 
population allele frequency vector p (Robertson 1952). For ex- 
ample, the probability that the G 0 cross is AA X AA is simply 
p\, while the probability that the Gi cross is AA X AA is 
Pa + Pa(^ ~ Pa) + - Pa) 2 - The latter expression is more 

complicated because an AA X AA G x cross can result from three G 0 
genotypic configurations: (1) AA X AA, (2) AA X AX, or (3) AX X 
AY, where "X" and "Y" represent any nucleotides other than A. The 
complexity of these probabilities coupled with our desire for an 
efficient implementation led us to make some benign simplifying 
assumptions. In particular, because the chance of any one line 
retaining more than two alleles at a site after 20 generations of 
inbreeding is negligible, we made the assumption that there exist 
no more than two alleles in the G 0 parentals. 
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Under the assumption that no more than two alleles are 
present in the G 0 of any one line, the number of possible genotypic 
configurations is 22 (enumerated in Table 1). Once the two alleles 
have been specified, say A and a, only six of the 22 configurations 
are relevant: (1) AA X AA, (2) AA X Aa, (3) AA X aa, (4) Aa X Aa, (5) 
Aa x aa, and (6) aa x aa. The initial probabilities of these states 
depend on p, and the distribution of parental genotypes at gen- 
eration n + 1 given those at generation n can be represented by a 
6x6 Markov transition matrix Q: 



Q= 



1 0 0 0 0 0 

1/4 1/2 1/4 0 0 0 

0 0 0 1 0 0 

1/16 1/4 1/8 1/4 1/4 1/16 

0 0 0 1/4 1/2 1/4 

0 0 0 0 0 1 



This says, for example, that the probability that the G 5 cross is 
AA X Aa given that the G 4 cross is Aa X Aa is Q 4 2 = 1/4. The dis- 
tribution of parental genotypes at G 20 given the distribution at G 0 
(i.e., the founding parents sampled from the population) can be 
found from Q 20 . 

The Markov chain {X n : n = 0, 1, . . . , 20} described by Q has 
two absorbing states, namely, state 1 (AA X AA) and state 6 (aa X 
aa). In other words, if the G 0 parents are homozygous for the same 
allele so that the Markov chain begins in an absorbing state, 
then the chain will remain in the same state through Xzo. Let 
r = min n {X n = 1 or X n = 6} denote the first time that the chain en- 
ters an absorbing state. Then Pr(r > 20|X 0 =s) is the probability 
that residual heterozygosity remains after 20 generations of in- 
breeding when the G 0 parents have genotypes described by state s. 
It turns out that Pr(X 20 = 2|r > 20, Xq = s e {2,3,4,5}) 9* 3/10 in- 
dependent of the initial state. Similarly, the values for X 2 o = 3,4,5 
are 1/20, 7/20,3/10, respectively. We use the full-sib inbreeding 
coefficient F 2 o to approximate Pr(r > 20) independent of the initial 
state. F 20 can be calculated using the recurrence relation 
F t+2 = 0.25 + (0.5)F t+ i + (0.25)F t , from which F 20 = 0.9863 is obtained 
(Crow and Kimura 1970). 

Using these approximations, we can calculate Pr(G; =g\p) for 
each of the 22 genotypes shown in Table 1 through 

Pr(X 20 = s|p) = £ Pr(X 20 = s\X 0 = z,r > 20) 

z=l 

XPr(r > 20|X 0 = z) Pr(X 0 = z|p) 
+ £ Pr(X 20 = s\X 0 = z,r<20) 

z=l 

X Pr(r < 20|X 0 = z) Pr(X 0 = z|p) 

We approximate these probabilities with the v defined in the 
section entitled "Updating equations for the EM." 

Modeling the sequencing process through Pr[R|G,£) 

Recall that there are N z - reads for line i. Assuming some arbitrary 
order for the reads, let R\ denote read k for line i. Then 

Ni 

Pi(Ri = n\N h Gi = g,e) = IIPr(^ = rf|G« = g,e) 
k=l 



Ni 4 

nn 

k = l ;=1 



Pr(r! < =/', no error) 



Pr(r^ =/', error) 



where s. is used to denote the sum of the entries in £. Note that we 
have explicitly modeled the realized coverage N z - as fixed rather 
than random. 



Data augmentation and the expected log-likelihood 

We have described a probabilistic model for Pr(R|p, e), where R 
is the 4 X m matrix of read counts at a site across lines and p and £ 
are the site-specific population frequencies and error probabilities, 
respectively. To estimate the parameters via maximum likelihood, 
we must find (p,e) = argmax pf Z(p, e|R) = argmax p £ Pr(R|p, e), 
which by direct maximization would be challenging. Instead, 
we appeal to the missing data previously described. Specifically, 
we take as missing data the unobserved 1 x m vector of geno- 
types G and the m unobserved indicator vectors Y 1 of length 
1 X N u 

We can write 



Pr(R = r|p, £ ) = £ Pr W" = r ^ = S^)MGi = *|p). 

z = l g=l 

Now, Pr(G/ =^|p) is a function of the inbreeding design, and its 
derivation is detailed above. The remaining term Pr(i?/ = r/|G; =g, s) 
can be expressed with the help of the read indicators as we now 
describe. For & = 1 , . . . , N if define 



n= 



0, if read k of line i has the correct base 



have 



k ^ 1, if read k of line i has an error 
Then upon augmenting the read data with the indicators, we 



Ni 4 



k=lj=l 



X I 



p r ^. = nj^i = g ,8) = nn - s - + £ / 



i-n 



( £ - - £ /) ) 1 {rf=;} 



Augmenting with read indicators and genotypes, we have 

N ' 4 'Mft 
4 



Pr(R = r, G, Y|p, e) = f[ Pr(G,- = gi \p) f[ j[ [ ^ (1 - e. + s, 



k=l j=l 



X 



((■ 



( £ -- £ /)j 1 {r\=j} 



So the augmented log-likelihood is 

m 

/(e|R=r,G,Y)= XlogPr(G ( =*,|p) 



(=1 k=l ; = 1 

+ Y> 



M k 



(i-n)iog (!-«.+«,; 



ilog((l-^)( 8 .-^))]l {rf . /} 



and for the EM we seek the parameter values for p, £ that maximize 
its expectation with respect to the distribution G,Y|R, 6*. Solu- 
tions to the maximization are given below. 



Updating equations for the EM 

Below we describe how our estimate of 6 is updated from 6° to 0 1 in 
one iteration of the EM. Let 1 denote a vector or matrix of ones, 
and let I denote the identity matrix. We use p° and £° to denote the 
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previous estimates in 0° and treat each as a 4 X 1 column vector. 
The operator o denotes the Hadamard product of two matrices. 

Let F 2 o be as described above and define the 22 X 1 vector 
of genotypic probabilities v as follows. For z = l,...,4, v,- = 
(p?) (1 - F 2 o) +p- ) F 2 o. These are the homozygous probabilities 
based on the nucleotide frequencies in the initial population. For 
segregating A and C, we use v 5 =v 7 = (3/5)pJp§(l - F 2 o) and 
v 6 = (4/5)pJp§(l - F 20 ); v 8 through v 22 are specified similarly for 
the remaining segregating pairs (see Table 1). Note that the co- 
efficients 3/5 and 4/5 arise from the X 20 probabilities from the 
Markov chain described above. 

Let A= (l 4 x22-|M)o£l lx22 andB= (±M)°((l 4x i-(l4x4- U)e) 
11x22) be 4 X 22 matrices whose rows and columns index nucle- 
otides (A, C, G, T) and genotypes (1-22), respectively. Let J be the 
L X 22 matrix whose entries are defined as 

Jv = v,-n(A* / + B k/ .) R ". 

k=l 

Let H be the m X 22 matrix obtained from J by normalizing 
each of its rows to sum to one (i.e., 



H« 



J*/ \ 

^22 , >• 



The m rows of H correspond to each of the m inbred lines and 
report posterior probabilities for each of the 22 genotypes (see 
Table 1). The 1 X 22 vector \\ x ffl H contains the expected number 
of lines of each genotype. Let K be the 4 X 22 matrix obtained 
from M through K /; - = TM/y/31 . Then the updated 4x1 vector of 
nucleotide frequency estimates p* is given by 



P = 



KH 1„ 



\\ x 4KH T l m x 1 



Finally, let S be the 4 X 22 matrix whose entries are 



0, 



if A jy + B J y>0 
otherwise 



Then the updated 4x1 vector of error probability estimates £* is 
given by 

# _ (So(RH)) 122x1 



; 4 R1^ 



Software access 

JGIL is available for download at http://www4.ncsu.edu/~eastone2/ 
software. 
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